data processing pipeline
Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora
Hennara, Khalil, Bastati, Ahmad, Hreden, Muhammad, Hamed, Mohamed Motasim, Aldallal, Zeina, Chrouf, Sara, AlModhayan, Safwan
The performance of large language models (LLMs) and large multimodal models (LMMs) depends heavily on the quality and scale of their pre-training datasets. Recent research shows that large multimodal models trained on natural documents where images and text are interleaved outperform those trained only on image-text pairs across a wide range of benchmarks, leveraging advanced pre-trained models to enforce semantic alignment, image-sequence consistency, and textual coherence. For Arabic, however, the lack of high-quality multimodal datasets that preserve document structure has limited progress. In this paper, we present our pipeline Wasm for processing the Common Crawl dataset to create a new Arabic multimodal dataset that uniquely provides markdown output. Unlike existing Arabic corpora that focus solely on text extraction, our approach preserves the structural integrity of web content while maintaining flexibility for both text-only and multimodal pre-training scenarios. We provide a comprehensive comparative analysis of our data processing pipeline against those used for major existing datasets, highlighting the convergences in filtering strategies and justifying our specific design choices. To support future research, we publicly release a representative dataset dump along with the multimodal processing pipeline for Arabic.
DrugMCTS: a drug repurposing framework combining multi-agent, RAG and Monte Carlo Tree Search
Yang, Zerui, Wan, Yuwei, Yan, Siyu, Matsuda, Yudai, Xie, Tong, Hoex, Bram, Song, Linqi
Recent advances in large language models have demonstrated considerable potential in scientific domains such as drug repositioning. However, their effectiveness remains constrained when reasoning extends beyond the knowledge acquired during pre-training. Conventional approaches, such as fine-tuning or retrieval-augmented generation, face limitations in either imposing high computational overhead or failing to fully exploit structured scientific data. To overcome these challenges, we propose DrugM-CTS, a novel framework that synergistically integrates RAG, multi-agent collaboration, and Monte Carlo Tree Search for drug repositioning. The framework employs five specialized agents tasked with retrieving and analyzing molecular and protein information, thereby enabling structured and iterative reasoning. Extensive experiments on the DrugBank and KIBA datasets demonstrate that DrugMCTS achieves substantially higher recall and robustness compared to both general-purpose LLMs and deep learning baselines. Our results highlight the importance of structured reasoning, agent-based collaboration, and feedback-driven search mechanisms in advancing LLM applications for drug repositioning.
TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
Song, Xingchen, Xing, Mengtao, Ma, Changwei, Li, Shengqiang, Wu, Di, Zhang, Binbin, Pan, Fuping, Zhou, Dinghao, Zhang, Yuekai, Lei, Shun, Peng, Zhendong, Wu, Zhiyong
It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data.
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Wang, Qiuheng, Shi, Yukai, Ou, Jiarong, Chen, Rui, Lin, Ke, Wang, Jiahao, Jiang, Boyuan, Yang, Haotian, Zheng, Mingwu, Tao, Xin, Yang, Fei, Wan, Pengfei, Zhang, Di
As visual generation technologies continue to advance, the scale of video datasets has expanded rapidly, and the quality of these datasets is critical to the performance of video generation models. We argue that temporal splitting, detailed captions, and video quality filtering are three key factors that determine dataset quality. However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. The core of our approach lies in improving the consistency between fine-grained conditions and video content. Specifically, we employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency. We then provide structured captions for the splitted videos, with an average length of 200 words, to improve text-video alignment. Additionally, we develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus. Finally, we incorporate several metrics into the training process of the generation model, further refining the fine-grained conditions. Our experiments demonstrate the effectiveness of our data processing pipeline and the quality of the proposed Koala-36M dataset. Our dataset and code will be released at https://koala36m.github.io/.
WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
Qiu, Jiantao, Lv, Haijun, Jin, Zhenjiang, Wang, Rui, Ning, Wenchang, Yu, Jia, Zhang, ChaoBin, Li, Zhenxiang, Chu, Pei, Qu, Yuan, Shi, Jin, Lu, Lindong, Peng, Runyu, Zeng, Zhiyuan, Tang, Huanze, Lei, Zhikai, Hong, Jiawei, Chen, Keyu, Fei, Zhaoye, Xu, Ruiliang, Li, Wei, Tu, Zhongying, Dahua, Lin, Qiao, Yu, Yan, Hang, He, Conghui
This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering. From approximately 68 billion original English documents, we obtained 2.22T Tokens of safe data and selected 1.0T Tokens of high-quality data as part of WanJuan-CC. We have open-sourced 100B Tokens from this dataset. The paper also provides statistical information related to data quality, enabling users to select appropriate data according to their needs. To evaluate the quality and utility of the dataset, we trained 1B-parameter and 3B-parameter models using WanJuan-CC and another dataset, RefinedWeb. Results show that WanJuan-CC performs better on validation datasets and downstream tasks.
YAYI 2: Multilingual Open-Source Large Language Models
Luo, Yin, Kong, Qingchao, Xu, Nan, Cao, Jia, Hao, Bao, Qu, Baoyu, Chen, Bo, Zhu, Chao, Zhao, Chenyang, Zhang, Donglei, Feng, Fan, Zhao, Feifei, Sun, Hailong, Yang, Hanxuan, Pan, Haojun, Liu, Hongyu, Guo, Jianbin, Du, Jiangtao, Wang, Jingyi, Li, Junfeng, Sun, Lei, Liu, Liduo, Dong, Lifeng, Liu, Lili, Wang, Lin, Zhang, Liwen, Wang, Minzheng, Wang, Pin, Yu, Ping, Li, Qingxiao, Yan, Rui, Zou, Rui, Li, Ruiqun, Huang, Taiwen, Wang, Xiaodong, Wu, Xiaofei, Peng, Xin, Zhang, Xina, Fang, Xing, Xiao, Xinglin, Hao, Yanni, Dong, Yao, Wang, Yigang, Liu, Ying, Jiang, Yongyu, Wang, Yungan, Wang, Yuqi, Wang, Zhangsheng, Yu, Zhaoxin, Luo, Zhen, Mao, Wenji, Wang, Lei, Zeng, Dajun
As the latest advancements in natural language processing, large language models (LLMs) have achieved human-level language understanding and generation abilities in many real-world tasks, and even have been regarded as a potential path to the artificial general intelligence. To better facilitate research on LLMs, many open-source LLMs, such as Llama 2 and Falcon, have recently been proposed and gained comparable performances to proprietary models. However, these models are primarily designed for English scenarios and exhibit poor performances in Chinese contexts. In this technical report, we propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback. Extensive experiments on multiple benchmarks, such as MMLU and CMMLU, consistently demonstrate that the proposed YAYI 2 outperforms other similar sized open-source models.
Event-Driven Scalability in Data Processing Pipeline
Building a data processing pipeline is one of the most common problem statements, for which you would have written small scripts or built a full-fledged scalable system based on the amount, and frequency of data. In this article, we will talk about the idea of event-driven scalability, the backbone that will be cost-optimized, and requires a minimum amount of development and operations. Why build an event-driven and scalable data processing pipeline? While working with startups or building a team project or a personal project, that requires a pipeline for data processing, there is always a constraint of cost. We will use a simple example for building a metadata extraction system for e-commerce products.
Operationalizing Machine Learning from PoC to Production - KDnuggets
Many companies use machine learning to help create a differentiator and grow their business. However, it's not easy to make machine learning work as it requires a balance between research and engineering. One can come up with a good innovative solution based on current research, but it might not go live due to engineering inefficiencies, cost and complexity. Most companies haven't seen much ROI from machine learning since the benefit is realized only when the models are in production. Let's dive into the challenges and best practices that one can follow to make machine learning work.
NVIDIA Brings The Power Of GPU To Data Processing Pipelines
NVIDIA has launched an open source project called Real-time Acceleration Platform for Integrated Data Science (RAPIDS) that aims to deliver end-to-end data science infrastructure based on GPUs. GPU-backed machines play an essential role in generating machine learning models. Data scientists run training jobs that are computationally intensive on GPUs. Massive datasets that are converted into complex matrices of numbers are used as an input for machine learning and deep learning models. During the training process, these matrices are added, multiplied and subtracted from other complex matrices.
Data Quality in the era of A.I. – Towards Data Science – Medium
As the director of datamine decision support systems, I've delivered more than 80 data-intensive projects -- including data warehousing, data integration, business intelligence, content performance and predictive models -- across several industries and high-profile corporations. In most cases, data quality proved to be a critical success factor. The obvious challenge in every case was to effectively query heterogeneous data sources, then extract and transform data towards one or more data models. The non-obvious challenge was the early identification of data issues, which in most cases were unknown to the data owners as well. There are many aspects to data quality, including consistency, integrity, accuracy, and completeness.